Name | Version | Summary | date |
kreuzberg |
3.9.1 |
Document intelligence framework for Python - Extract text, metadata, and structured data from diverse file formats |
2025-07-29 15:54:53 |
docling-analysis-framework |
1.1.0 |
AI-ready analysis framework for PDF and Office documents using Docling for content extraction |
2025-07-29 14:34:10 |
xml-analysis-framework |
1.3.0 |
XML document analysis and preprocessing framework designed for AI/ML data pipelines |
2025-07-29 14:32:08 |
document-data-extractor |
1.0.4 |
Best open-source document to markdown extractor for LLM training data. Convert PDF, Word, PowerPoint, Excel, images, URLs to clean markdown, JSON, HTML locally. Alternative to Unstructured, Docling, Marker, MarkItDown, MinerU, PaddleOCR, Tesseract |
2025-07-29 08:25:56 |
qdrant-loader |
0.5.1 |
A tool for collecting and vectorizing technical content from multiple sources and storing it in a QDrant vector database. |
2025-07-29 06:41:31 |
contextgem |
0.12.1 |
Effortless LLM extraction from documents |
2025-07-27 20:11:08 |
aikitx |
1.0.0 |
A comprehensive GUI toolkit for Large Language Models (LLMs) with GGUF support, document processing, email automation, and multi-backend inference |
2025-07-25 19:44:31 |
llm-data-converter |
2.2.0 |
Best open-source document to markdown converter for LLM training data. Convert PDF, Word, PowerPoint, Excel, images, URLs to clean markdown, JSON, HTML locally. Alternative to Unstructured, Docling, Marker, MarkItDown, MinerU, PaddleOCR, Tesseract |
2025-07-25 13:32:07 |
llm-text-splitter |
0.2.0 |
A lightweight, rule-based text splitter for LLM context window management, handles multiple file formats and enriches chunks with metadata. |
2025-07-24 12:21:01 |
mseep-kreuzberg |
3.8.2 |
Document intelligence framework for Python - Extract text, metadata, and structured data from diverse file formats |
2025-07-17 03:32:28 |
pdf-splitter-cli |
0.1.1 |
A modern command-line tool to split PDF files into smaller chunks with progress bars and automatic filename generation |
2025-07-17 01:37:12 |
pdf-ocr-processor |
2.0.3 |
Advanced PDF OCR processing with AI-powered text extraction and selectable text overlays |
2025-07-11 21:11:24 |
ai-chunking |
0.1.4 |
A powerful Python library for semantic document chunking and enrichment using AI |
2025-03-16 20:44:19 |
atai-pdf-tool |
0.1.0 |
A tool for parsing and extracting text from PDF files with OCR capabilities |
2025-02-27 11:15:46 |
smart-llm-loader |
0.1.0 |
A powerful PDF processing toolkit that seamlessly integrates with LLMs for intelligent document chunking and RAG applications. Features smart context-aware segmentation, multi-LLM support, and optimized content extraction for enhanced RAG performance. |
2025-02-14 12:42:55 |
fileseek |
0.1.3 |
FileSeek – AI-Powered Local Document Archive&Search |
2025-02-08 07:13:54 |
tikara |
0.1.5 |
The metadata and text content extractor for almost every file type. |
2025-01-26 23:33:40 |
peslac |
0.1.4 |
A Python package for the Peslac API |
2025-01-25 06:54:20 |
aimq |
0.1.0 |
A robust message queue processor for Supabase pgmq with AI-powered document processing capabilities |
2025-01-18 22:17:05 |
pdf-parser-header-footer |
0.1.0 |
A Python package for processing PDFs with header and footer detection |
2025-01-14 16:10:34 |